Introduction to Information Retrieval 


INF 1417 CS 121 
Donald J. Patterson 


Content adapted from Hinrich Schutze 


http://www. informationretrieval.org 


Robust Crawling 


A Robust Crawl Architecture 
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Duplicate Elimination 


e For a one-time crawl 
e Test to see if an extracted,parsed, filtered URL 
e has already been sent to the frontier. 
e has already been indexed. 
e For a continuous crawl 
e See full frontier implementation: 
e Update the URL's priority 
e Based on staleness 


e Based on quality 


e Based on politeness 


Distributing the crawl 


e The key goal for the architecture of a distributed crawl is 
cache locality 

e We want multiple crawl threads in multiple processes at 
multiple nodes for robustness 
e Geographically distributed for speed 

e Partition the hosts being crawled across nodes 


e Hash typically used for partition 


e How do the nodes communicate? 


Robust Crawling 


The output of the URL Filter at each node is sent to the Duplicate 
Eliminator at all other nodes 
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URL Frontier 


e Freshness 
e Crawl some pages more often than others 
e Keep track of change rate of sites 
e Incorporate sitemap info 
e Quality 
e High quality pages should be prioritized 
e Based on link-analysis, popularity, heuristics on content 


e Politeness 


e When was the last time you hit a server? 


URL Frontier 
e Freshness, Quality and Politeness 


e These goals will conflict with each other 

e A simple priority queue will fail because links are bursty 
e Many sites have lots of links pointing to themselves 

creating bursty references 
e Time influences the priority 
e Politeness Challenges 

e Even if only one thread is assigned to hit a particular host it 

can hit it repeatedly 


e Heuristic : insert a time gap between successive requests 
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Magnitude of the crawl 


e To fetch 1,000,000,000 pages in one month... 
® asmall fraction of the web 
e we need to fetch 400 pages per second ! 


e Since many fetches will be duplicates, unfetchable, filtered, 


etc. 400 pages per second isn’t fast enough 


Web Crawling Outline 


Overview 
e Introduction 


e URL Frontier 
e Robust Crawling 
e DNS 
e Various parts of architecture 
e URL Frontier 
e Index 
e Distributed Indices 


Connectivity Servers 


Robust Crawling 


The output of the URL Filter at each node is sent to the Duplicate 
Eliminator at all other nodes 
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URL Frontier Implementation - Mercator 


e URLs flow from top to bottom 
e Front queues manage priority 


e Back queue manage politeness 


F "Front" 
Queues 


e Each queue is FIFO 


AE 
B "Back" II 
Ir 
Queues elle 
SE 


Timing Heap 


zess J 
LAN fin 


TO 
e SO 


URL Frontier Implementation - Mercator 


e Prioritizer takes URLS and assigns a 
Front queues 
priority 


e Integer between 1 and F 


Prioritizer 


| 


e Appends URL to appropriate queue 


e Priority 
F "Front" 
Queues 


Front Queue Selector 


e Based on rate of change 


e Based on quality (spam) 


e Based on application 


URL Frontier Implementation - Mercator 


Back queues e Selection from front queues is 
Y e e oo 
initiated from back queues 
Mapping Table 
e ° e Pick a front queue, how? 


B "Back" e Round robin 


Queues 


e Randomly 
e Monte Carlo 


e Biased toward high priority 


URL Frontier Implementation - Mercator 


Back queues e Each back queue is non-empty 
Y ane 
while crawling 
ye ° e Each back queue has URLs from 


one host only 


B "Back" 
Queues 


e Maintain a table of URL to back 


queues (mapping) to help 


URL Frontier Implementation - Mercator 


e Timing Heap 


——— 
| Mapping Table 
T : | e Has earliest time that a host can 


be hit again 


B "Back" 
Queues 


e Earliest time based on 

e Last access to that host 

e Plus any appropriate heuristic 
e robots.txt “crawl-delay” 


e sitemaps instruction 


URL Frontier Implementation - Mercator 


e A crawler thread needs a URL 


<——————.,, ® H gets the timing heap root 
| Mapping Table so a 
e |t gets the next eligible queue 


based on time, b. 


B "Back" 
Queues 


e |t gets a URL from b 

e |fb is empty 

e Pulla URL v from front queue 
If back queue for v exists place 
it in that queue, repeat. 
lse.add v to b/- update heap. 


AN 
í 


URL Frontier Implementation - Mercator 


e How many queues? 


eee 
| Mapping Table 
1 2 B | e ~3 times as many back queues 


as crawler threads 


B "Back" 
Queues 


e Web-scale issues 


e This won't fit in memory 


© e Solution 


e Keep queues on disk and 


keep a portion in memory. 


URL Frontier Implementation - Mercator - walk through the process 
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Web Crawling Outline 


Overview 

e Introduction 

e URL Frontier 

e Robust Crawling 
e DNS 
e Various parts of architecture 
e URL Frontier 

e Index 


e Distributed Indices 
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The index 
e Why does the crawling architecture exists? 


e To gather information from web pages (aka documents). 
e What information are we collecting? 
e Keywords 
e Mapping documents to a “bags of words” (aka vector 
space model) 
e Links 
e Where does a document link to? 


© ho links to a document 


The index has a list of vector space models 
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Justin Bieber was drag racing in a 1 breakinc 1 report 
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Our index is a 2-D array or Matrix 


A Column for Each Web Page (or “Document”) 
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N Row For Each Word (or "Term"? 
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“Term-Document Matrix” Capture Keywords 


A Column for Each Web Page (or “Document”) 
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N Row For Each Word (or "Term"? 
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The Term-Document Matrix 
e Ís really big at a web scale 


e |t must be split up into pieces 

e An effect way to split it up is to split up the same way as the 
crawling 
e Equivalent to taking vertical slices of the T-D Matrix 
e Helps with cache hits during crawl 


e Later we will see that it needs to be rejoined for calculations 


across all documents 
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